I will be the first to admit I know nothing about wine. I hoping that learning about the physicochemical properties in good wines will help me choose good wines to gift my friends instead of my usual method, ‘oooo, what a pretty label!’. When I do drink wines I tend to lean towards the sweeter whites, so that is why I decided to explore the white wine data.
There are 4898 observations with 13 variables as detailed below.
The following variables are being analyzed: * fixed acidity (tartaric acid - g / dm^3) * volatile acidity (acetic acid - g / dm^3) - high levels give a vinegar like quality. * citric acid (g / dm^3)- add ‘freshness’ and flavor to wines * residual sugar (g / dm^3) - sweet wines have >45 g/dm^3 * chlorides (sodium chloride - g / dm^3) - amount of salt * free sulfur dioxide (mg / dm^3) - dissolved S02 gas. prevents microbial growth and oxidation of wine.becomes evident in nose and taste at >50 mg /dm^3) * total sulfur dioxide (mg / dm^3) - contains both free and fixed forms. * density (g / cm^3) - changes depending on alcohol and sugar content. Note: density of water = 1, alcohol <1, sugar >1 * pH - 0 (very acidic) to 14 (very basic). Most wines are 3-4 * sulphates (potassium sulphate - g / dm3) - creates SO2. antimicrobial and antioxidant * alcohol (% by volume) Output variable (based on sensory data): * quality (score between 0 and 10) - 0 (very bad) to 10 (excellent) * rating (category) - Categorical classification of the quality score. Poor wines scored between 0-3, average wines scored between 4- 6, and excellent wines scored between 7-10. + Note: I added this column to the table in python, because I am more familar with python for data wrangling efforts.
The background text explains that these observations are for white variants of the Portuguese “Vinho Verde” wine. Also the quality of wine is not evenly distributed (a lot more average wines than poor or excellent ones.) Also they weren’t sure if every variable is relevant so I guess that is for me to figure out!
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality rating
## Min. :3.000 average :3818
## 1st Qu.:5.000 excellent:1060
## Median :6.000 poor : 20
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Whoa there is a massive outlier in residual sugar. the max point is 65.8, when the 75% is only at 9.9. According to the documentation, ‘wines with greater than 45 grams/liter are considered sweet’, and in my limited wine knowledge I know that people have their preferences for dry or sweet wine. Gonna break the data set up into dry wines (sugar <45) and sweet wines (sugar >45)
## [1] "dry wine summary"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2781 Mean :0.3341 Mean : 6.379
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :31.600
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0103 Max. :3.820 Max. :1.0800 Max. :14.20
## quality rating
## Min. :3.000 average :3817
## 1st Qu.:5.000 excellent:1060
## Median :6.000 poor : 20
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
## [1] "sweet wine summary"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :7.8 Min. :0.965 Min. :0.6 Min. :65.8
## 1st Qu.:7.8 1st Qu.:0.965 1st Qu.:0.6 1st Qu.:65.8
## Median :7.8 Median :0.965 Median :0.6 Median :65.8
## Mean :7.8 Mean :0.965 Mean :0.6 Mean :65.8
## 3rd Qu.:7.8 3rd Qu.:0.965 3rd Qu.:0.6 3rd Qu.:65.8
## Max. :7.8 Max. :0.965 Max. :0.6 Max. :65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.074 Min. :8 Min. :160 Min. :1.039
## 1st Qu.:0.074 1st Qu.:8 1st Qu.:160 1st Qu.:1.039
## Median :0.074 Median :8 Median :160 Median :1.039
## Mean :0.074 Mean :8 Mean :160 Mean :1.039
## 3rd Qu.:0.074 3rd Qu.:8 3rd Qu.:160 3rd Qu.:1.039
## Max. :0.074 Max. :8 Max. :160 Max. :1.039
## pH sulphates alcohol quality rating
## Min. :3.39 Min. :0.69 Min. :11.7 Min. :6 average :1
## 1st Qu.:3.39 1st Qu.:0.69 1st Qu.:11.7 1st Qu.:6 excellent:0
## Median :3.39 Median :0.69 Median :11.7 Median :6 poor :0
## Mean :3.39 Mean :0.69 Mean :11.7 Mean :6
## 3rd Qu.:3.39 3rd Qu.:0.69 3rd Qu.:11.7 3rd Qu.:6
## Max. :3.39 Max. :0.69 Max. :11.7 Max. :6
Okay so there is only one sweet wine, so I can’t really do any analysis on that. I will move forward with only the dry wines.
Interesting points: - citric acid has a minimum value of 0.0, which after a little bit of research is okay. Not all wines have citric acid. - Volatile acidity (VA), sugar, free sulfur dioxide (free SO2), density, and sulphates have outliers on the high end. - fixed acidity(FA), total SO2, citric acid (CA), and pH have outliers on both ends.
Let’s investigate these outliers and see if they are all from the same wine, if so, we can safely remove from our analysis.
Can’t disregard the outliers because there isn’t one wine that has all of them. Our outlier set has 3940 observations, so almost every single wine has an outlier quality of some sort. So in short we got lots of outliers in all the categories except alcohol and chlorides. So we will be zooming into the data using coor_cartesian to preserve the values, but getting plots of the majority of the data.
Fixed acidity was normally distributed, no transformation needed. Let’s look at the rating break down.
All the ratings peak at around the same point and they still maintain a generally normal distribution. FA is not a factor. Let’s run some t.tests to be sure
##
## Welch Two Sample t-test
##
## data: poor_wines$fixed.acidity and avg_wines$fixed.acidity
## t = 1.8485, df = 19.049, p-value = 0.08011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0942196 1.5209422
## sample estimates:
## mean of x mean of y
## 7.600000 6.886639
##
## Welch Two Sample t-test
##
## data: poor_wines$fixed.acidity and exc_wines$fixed.acidity
## t = 2.2642, df = 19.143, p-value = 0.03536
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06655039 1.68316659
## sample estimates:
## mean of x mean of y
## 7.600000 6.725142
##
## Welch Two Sample t-test
##
## data: avg_wines$fixed.acidity and exc_wines$fixed.acidity
## t = 5.9055, df = 1845.3, p-value = 4.174e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1078635 0.2151310
## sample estimates:
## mean of x mean of y
## 6.886639 6.725142
Well, I retract my previous statement. there is a significant difference in the mean of fixed acidity for excellent and average and poor wines, but there is no significant difference between poor and average means.
Volatile acidity was left skewed so let’s see if a transformation makes it normal:
That did the trick, so in order to compare VA with the other data we need to first take the square root. I decided to do the square root instead of log 10 because the data didn’t have enough of a spread for a logritmic scale. the square root did the job nicely!
VA is a factor in determining wine quality because the three graphs have the roughly the same shape, but the peak at slightly different values. Let’s run a t-test to see if the differences are significant
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_VA and poor_wines$trans_VA
## t = -1.6805, df = 19.12, p-value = 0.1091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09740153 0.01062695
## sample estimates:
## mean of x mean of y
## 0.5228504 0.5662377
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_VA and exc_wines$trans_VA
## t = 5.0034, df = 1696.3, p-value = 6.217e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.009415845 0.021557703
## sample estimates:
## mean of x mean of y
## 0.5228504 0.5073637
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_VA and exc_wines$trans_VA
## t = 2.2712, df = 19.431, p-value = 0.03468
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.00469983 0.11304830
## sample estimates:
## mean of x mean of y
## 0.5662377 0.5073637
There is a significant difference in the mean amount of the sqaure root of the volatile acidity between excellent wines and poor and average wines. However there is no signficant difference between poor and average wines. Still there is some significance so VA is a factor!
citric acid normally distributed, so the mean is an appropiate average, no transformation necessary
Looks like we have peaks at different levles so let’s see if it is significantly different!
##
## Welch Two Sample t-test
##
## data: avg_wines$citric.acid and poor_wines$citric.acid
## t = 0.02026, df = 19.511, p-value = 0.984
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03793915 0.03868214
## sample estimates:
## mean of x mean of y
## 0.3363715 0.3360000
##
## Welch Two Sample t-test
##
## data: avg_wines$citric.acid and exc_wines$citric.acid
## t = 3.1807, df = 2759.8, p-value = 0.001486
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.003955953 0.016673831
## sample estimates:
## mean of x mean of y
## 0.3363715 0.3260566
##
## Welch Two Sample t-test
##
## data: poor_wines$citric.acid and exc_wines$citric.acid
## t = 0.54095, df = 19.703, p-value = 0.5946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02843636 0.04832315
## sample estimates:
## mean of x mean of y
## 0.3360000 0.3260566
there is only a difference between average and excellent wines, so citric acid is a faint contender.
Definately skewed. Lets try a transform!
the sqrt transform resulted in a left skewed graph. When I did a log 10 transformation I see the the residual sugar is bimodal. Let’s see how the different ratings fit into that
Poor and average wines have more wines coming in at the higher sugar peak, while excellent wines have more wines with lower sugar contents. Sounds like a contender! We can’t do a t-test since the data is bimodal.
The chloride levels are skewed, let’s trasform!.
Look at the beautful normal curve. sqrt transformation did the trick!
Looks like the peaks are at slight different places. T-test time!
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_C and avg_wines$trans_C
## t = 0.49846, df = 19.069, p-value = 0.6239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02520204 0.04096329
## sample estimates:
## mean of x mean of y
## 0.2226194 0.2147388
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_C and exc_wines$trans_C
## t = 1.8405, df = 19.103, p-value = 0.08128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.003981942 0.062205054
## sample estimates:
## mean of x mean of y
## 0.2226194 0.1935078
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_C and exc_wines$trans_C
## t = 20.005, df = 2621.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01914986 0.02331201
## sample estimates:
## mean of x mean of y
## 0.2147388 0.1935078
there is a significant difference in the means chloride levels for average and excellent wines. Like citric acid, it might be a determining factor between average and excellent wines. I also created a new variable (trans_C) which is the sqrt of the chloride amount. Will be using theis value in investigations.
slightly skewed, lets transform!
The sqrt transformation did it! Looking nice and normal. Let’s see the ratings distribution.
average and excellent wine have about the same distributions, but the poor wines have multiple peaks. free SO2 isn’t a major determining factor for wine quality but might want to explore the poor wine free SO2 rating with other variables. let’s do a t.test to be sure.
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_fSO2 and avg_wines$trans_fSO2
## t = 0.52386, df = 19.028, p-value = 0.6064
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.374811 2.292887
## sample estimates:
## mean of x mean of y
## 6.224715 5.765677
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_fSO2 and exc_wines$trans_fSO2
## t = 0.52731, df = 19.063, p-value = 0.6041
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.372183 2.296729
## sample estimates:
## mean of x mean of y
## 6.224715 5.762442
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_fSO2 and exc_wines$trans_fSO2
## t = 0.075356, df = 2111.9, p-value = 0.9399
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08096016 0.08743069
## sample estimates:
## mean of x mean of y
## 5.765677 5.762442
There is no significant difference in the sqrt of the free SO2 levels in wine. NOT a contender!
Total SO2 is skewed, so let’s transform it!
Much better! A sqrt transform was the answer. Let’s check out the ratings.
Whoa, it looks like poor wines have more total SO2 than average and excellent wines. Let’s do a t-test and see!
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_TSO2 and avg_wines$trans_TSO2
## t = 0.68249, df = 19.04, p-value = 0.5031
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.327322 2.612061
## sample estimates:
## mean of x mean of y
## 12.40105 11.75868
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_TSO2 and exc_wines$trans_TSO2
## t = 1.3856, df = 19.086, p-value = 0.1818
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6655997 3.2755208
## sample estimates:
## mean of x mean of y
## 12.40105 11.09609
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_TSO2 and exc_wines$trans_TSO2
## t = 12.226, df = 2146.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5563108 0.7688717
## sample estimates:
## mean of x mean of y
## 11.75868 11.09609
Well that wasn’t what I expected! There is a significant difference between average and excellent meanof the sqrt of total SO2 levels, but no significant difference between poor wines and the other ratings. Still a contender.
Density is slightly skwed, but it is so tightly clustered I don’t think there will be much difference between the ratings. I’m not gonna transform this one due to the tightness of the data.Let’s check the ratings.
Hmmm, there are some different peaks, let’s do some t.tests for kicks and giggles.
##
## Welch Two Sample t-test
##
## data: poor_wines$density and avg_wines$density
## t = 0.66862, df = 19.196, p-value = 0.5117
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0009029601 0.0017515312
## sample estimates:
## mean of x mean of y
## 0.9948840 0.9944597
##
## Welch Two Sample t-test
##
## data: poor_wines$density and exc_wines$density
## t = 3.8708, df = 19.694, p-value = 0.0009742
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.001138508 0.003805530
## sample estimates:
## mean of x mean of y
## 0.994884 0.992412
##
## Welch Two Sample t-test
##
## data: avg_wines$density and exc_wines$density
## t = 21.226, df = 1708.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.001858512 0.002236955
## sample estimates:
## mean of x mean of y
## 0.9944597 0.9924120
Spoke too soon, there are significant differences between the mean density of excellent wines and poor or average wines. No diff in poor and avg tho.
the pH is almost normally distributed. Let’s check those ratings!
It looks like poor wine has a slighter higher (less acidic) pH than the other ratings. t.tests to confirm!
##
## Welch Two Sample t-test
##
## data: poor_wines$pH and avg_wines$pH
## t = 0.14352, df = 19.099, p-value = 0.8874
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09155584 0.10504156
## sample estimates:
## mean of x mean of y
## 3.187500 3.180757
##
## Welch Two Sample t-test
##
## data: poor_wines$pH and exc_wines$pH
## t = -0.58582, df = 19.404, p-value = 0.5647
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.12621679 0.07095264
## sample estimates:
## mean of x mean of y
## 3.187500 3.215132
##
## Welch Two Sample t-test
##
## data: avg_wines$pH and exc_wines$pH
## t = -6.3777, df = 1617.8, p-value = 2.34e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04494674 -0.02380313
## sample estimates:
## mean of x mean of y
## 3.180757 3.215132
Looks like there is a signifcant difference between the average and excellent wines, but not between poor and any other!
Sulphates are slightly skewed to the left so let’s transform them!
Much better. sqrt transformation did the trick. Let’s check out the ratings:
Looking pretty similar, but let’s do the t-tests just to be sure!
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_sulp and avg_wines$trans_sulp
## t = -0.52255, df = 19.153, p-value = 0.6073
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05039506 0.03025001
## sample estimates:
## mean of x mean of y
## 0.6837169 0.6937894
##
## Welch Two Sample t-test
##
## data: poor_wines$trans_sulp and exc_wines$trans_sulp
## t = -0.9055, df = 19.812, p-value = 0.3761
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05817715 0.02297222
## sample estimates:
## mean of x mean of y
## 0.6837169 0.7013194
##
## Welch Two Sample t-test
##
## data: avg_wines$trans_sulp and exc_wines$trans_sulp
## t = -2.4669, df = 1484.4, p-value = 0.01374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.013517393 -0.001542489
## sample estimates:
## mean of x mean of y
## 0.6937894 0.7013194
Glad I checked. there is a significant difference between the mean of the sqrt of the sulphate levels for avergae and excellent wines, not with poor wines.
Alcohol very left skewed with a peak at about 9.5 grams / liter (mean = 10.51 and median =10.4). Gonna see if plot the log(10) of the Alcohol content will normalize the graph.
Neither the log nor the sqrt helped. The log10() give multiple peaks, but a generally normalize shape. Not going to apply a transformation here, since neither transformed the data. Let’s see if the ratings grouping of the original data has any insights.
definately have some different distributions. Lets run the t.tests and find out how significant these differences are.
##
## Welch Two Sample t-test
##
## data: poor_wines$alcohol and avg_wines$alcohol
## t = 0.29377, df = 19.161, p-value = 0.7721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4931970 0.6543541
## sample estimates:
## mean of x mean of y
## 10.34500 10.26442
##
## Welch Two Sample t-test
##
## data: poor_wines$alcohol and exc_wines$alcohol
## t = -3.8747, df = 19.761, p-value = 0.0009603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.6480638 -0.4939803
## sample estimates:
## mean of x mean of y
## 10.34500 11.41602
##
## Welch Two Sample t-test
##
## data: avg_wines$alcohol and exc_wines$alcohol
## t = -27.118, df = 1539.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.234898 -1.068304
## sample estimates:
## mean of x mean of y
## 10.26442 11.41602
No significant difference between the mean alcohol content of poor and average wine, but there is a significant difference between excellent and poor or average wine.
It is clear to see that we have a lot more average wines than any other rating. At least excellent wines are visible, while poor wines just aren’t even a contender.
wine_qual has 4898 observations and originally 11 variables. 10 of the variables are based on physiochemcial tests on the wine and one (quality)is based on sensory input from wine critics. White can either be sweet or dry and so I wanted to seperate these wines from themselves. It turns out there is only one sweet wine (residual sugar >45 g/L), so I just excluded it from my data set. I created a categorical variable, rating, in which I assigned a rating to a group od quality scores. Poor wines score between 0 and 3, average wines scored between 4 and 6, and excellent wines scored between 7 and 10. There are 20 poor wines, 3817 average wines, and 1060 excellent wines. I also performed transformations on the following variables to normalize their data: volatile acidity, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all recieved a sqrt transformation. Sugar recieves a log_10 transformation
I want to figure out which chemical properties lead to a higher wine rating. Pretty much every chemical property yeilded a statsitically significant difference in their mean values between average and excellent wine. The only exception was free sulfur dioxide. However, only a few yeilded statstically significant differences between their mean values between poor and excellent wines. These variables are fixed acidity, the sqrt of volatile acidity, density, and alcohol. I will be investigate these variables further in the bivartiate plot section.
Even though free sulfur dioxide didn’t yeild stastically significant differences in mean values for the ratings, it has an interesting property. If the free sulfur levels are above 50 ppm, they are detectable and so it would be interesting to see how detectable free sulfur dioxide levels affect the ratings. Also residual sugar had an interesting result. When I did a logrithmic transformation on it, it turned out to be bimodal. the ratings reflected this distribution as well, but excellent wines had larger peak for lower sugar amounts, while average and excellent wine had the larger peak for higher sugar amounts.
I created a factor variable called ‘rating’ to give qualitative meaning to the ‘quality’ measure. In the next section I plan on using these ratings to see what range each rating has in each of the physiochemical properties.
I preformed sqrt transformations on volatile acidity, chlorides, free and total sulfur dioxide, and sulphates. I did this transformation to normalize the data, so I can compare it to other data down the line. I also performed a log 10 transformation on residual sugar to normalize it, but I ended up with a bimodal distribution. the sqrt transformation didn’t help normalize the data, so it’ll be interesting to see what behavior shows up when comparing other variables to sugar!
dry_wine_qual2 will be my main dataframe for the bivariate analysis. I deleted the orginal forms of the the transformed varialbes (volatile acidity, chlorides, total sulfur dioxide, sugar and suplhates). I also deleted the free.sulfur.dioxide and its transformed column because there wasn’t a significant difference in its mean values between ratings. I also deleted the quality column because I’m going to use my rating columns to perform my analysis, not the specific numerical scores.
Need to reorder my rating factors (right now its alphabetical) and I want to take a closer look at the boxplots for quality and other variables. I don’t need to really investigate density since the values are so tightly clustered around 0.99
No suprise that sugar and alcohol have a strong relationship with density. I’m surprised chlorides didn’t, since salt can greatly affect the density of a liquid. Most other relationships were weakly correlated. We are interested in investigating the physiochemical traits that result in a high quality rating. so let’s start at the histograms and boxplots.
##
## average excellent poor
## 3817 1060 20
I already know that the differences are significant, and it seems that the median fixed acidity levels decrease as the ratings get better. Also there are a lot of outliers.
A note of all the t-tests done in the previous sections. They have to be unpaired t-tests because each rating doesn’t have the same number of observations. Also there are not that many poor ratings so that could be a reason for no significant difference between poor and average ratings.
Again I know that the difference between the means is statistically significant, the medians don’t show a particular pattern, but looking at the table of means so far, the mean value of the sqrt of the volatile acid decreases as the rating improves.
There are quite a few outlierrs on the average and excellent ratings. I already know the difference of the means is significant, so it appears that as the rating improves the density decreases.
There are a few outliers for the average rating, but it is clear that the higher rating has the higher alcohol content.
## ratings FA_means TVA_means D_means AL_means
## 1 poor 7.600000 0.5662377 0.9948840 10.34500
## 2 average 6.886639 0.5228504 0.9944597 10.26442
## 3 excellent 6.725142 0.5073637 0.9924120 11.41602
The t tests for these four variables reveales that the difference of thier mean values is significant. Findings: * Fixed acidity (FA) decreases as rating increases. * sqrt of the volatile acidity decreases as rating increases * density decreases as rating increases * alcohol content is about the same for poor and average wines, but it is higher for excellent wines.
Looking forward to exploring these variables in more depth in the next section! Only explored these variables because they allowed us to reject the null hypothesis of no difference between the means of the values for each rating.
Let’s now take a look at sugar. I want to explore sugar because it had a bimodal distribution over all and for each rating. The excellent wines had a higher peak at lower sugar levels than poor or average wines.
sugar also has a strong realtionship with density and a weak relationship with total sulfur dioxide.
## [1] "poor wine log10 sugar summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1549 0.2007 0.6628 0.6233 1.0293 1.2095
## [1] "average wine log10 sugar summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2304 0.7782 0.6615 1.0170 1.4997
## [1] "excellent wine log10 sugar summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.09691 0.25527 0.58826 0.57646 0.86923 1.28443
There is no clear pattern here, average wine have a higher mean than both poor and excellent, but excellent wines have a lower mean than poor wines. maybe some patterns will emerge when sugar is compared to other values.
Here we can see that more excellent wines peak at a lower sugar level, while average a poor wines have a higher peak at higher sugar levels. I would say sugar level isn’t a major determining factor, since all three ratings share peaks at about the same place, but it is clear that more excellent wines have a lower sugar amount than average or excellent wines.
the transformation of sugar and density had a strong correlation. As the sugar increases, so does the dentsity. That makes sens since density is dervived from the components in the liquid. There is a bit of an odd ball point out there. Definately wantt to add ratings to this chart to see if there are more patterns.
as the amount of total sulfur dioxide increases, the amount of residual sugar increases as well. Can’t wait to see what the ratings will tell us! Note: used dry_wine_qual because I hadn’t added the transformed sugar variable to dry_wine_ qual2
total sulfur dioxide and density have a moderate relationship as show inn the plot. as the amount of total sulfur dioxide increases the density increases.
As the alcohol content increases, the density decreases. Choose to look at this graph because it has one of the strongest correlations in the ggpairs plot. Definately wantt to look at this plot with ratings and see if there are any patterns
I investgated the four chemical properties that displayed statistically significant differences for their mean values between the ratings. These chemical properties are the fixed acidity, the square root of the volatile acidity, the density, and the alchol content. I found that the mean value for all of them except alcohol content decreases as the rating increased. For alcohol content then mean increases as the rating increases.
I also looked at the relationships that displayed some amount of correlation on the ggpairs summary. I looked at density with , alcohol, sugar, and the square root of the total amount of sulfur dioxide. The density decreases when alcohol increases, density increases when sugar and total sulfur dioxide increases. Another relationship I looked at was the sqrt of total sulfur dioxide and sugar. The data points seem to be in two big clusters, so I really want to add ratings to that graph to see if any patterns show up.
density was strongly related to sugar and alcohol. This isn’t suprising since the density of a liquid is determined by the components in the liquid. More sugar suggests a higher density which makes sense since the density of sugar is greater than 1. More alcohol suggests a lower density which also makes sense since the density of alcohol is less than one.
the sqrt of free sulfur dioxide didn’t yeild any significant differences bewteen its mean values for the different ratings. I will use this to build plots with my four variables of interest (fixed acidity, sqrt of volatile acidity, density, and alcohol) and add rating as a color to see if any patterns arise.
The vertical red line represents the detectable threshold for free sulfur dioxide. the horizontal line represents the mean value for that y-value. I chose to smooth the average and the excellent ratings because they have so many more points than the poor rating, that it was impossible to see the poor ratings at all. Also the poor ratings were just all over the map, so there really isn’t any clear relationship betwen our four variables and a poor rating.
Fixed acidity - for undetectable levels of free SO2, average wines contain more fixed acidity. For detectable levels of free SO2, excellent wines generall contain more fixed acidity.
volatile acidity = Also known as the spoilage factor (http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity). generally average wines contain more volatile acidity than excellent wines. (negative slope)
density = average wines have a higher density than excellent wines.
alcohol = excellent wines have a high alcohol content than average wines.
black lines represent the means for each value. excellent wines tend to have higher than averge alcohol content and lower than average densities.
Excellent wines tend NOT to have higher than average volatile acidity and higher than average densities.
The black lines represent the averages for each value. Excellent wines tend to have lower than average density and fixed acidity.
excellent wines have higher than average alcohol content. no conclusion for fixed acidity though.
as the alcohol content of excellent wine increases the amount of volatile acid increases.
no clear patterns here.
Findings so far: excellent wines have lower than average densities and higher than average alcohol content. These are great features to know for excellent wine because you can find them on the label (density is mass / volume). The other two variables didn’t provide many insights.
Let’s check out our other bivariate graphs that we wanted to add another layer to!
so as expected , the higher alcohol content corresponds to the lower densities. the red line represents the trend for excellent wines, the yellow represents the trend for average wines. For lower sugar ratings, excellent wines tend to have more alcohol and less density than average wines. However at higher sugar ratings (about 15 g) they have the same trends.
I was hoping for more. there are no clusters for ratings, alcohol or density, so I settled on looking at the trend for excelelnt and average. they both generally follow the same curve. No insights here.
I already knew that excellent wines tend to have a lower density. It also seems like excellent wines tend to have lower than average levels of total sulfur dioxide.
Because both volatile acidity and total sulfur dioxie had a sqrt transformation, I can compare the original values and get the same relationship. I wanted to look at this relationship because i was surprised that volatile acidity wasn’t more of a factor in determining the quality of wine. Volatile acidity is the measure of wine spoilage (http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity), and is a by product of microbial metabolism. Sulfur dioxide is an anti-microbial additive to wine and it can reduce the amount of volatile acidity in wine. Interestingly enough, it seems like the sulfur dioxide is doing a more effective job in excellent wines (VA decreases with more SO2), not so much in average and poor wines. ( the VA increases as the SO2 increases!) need to calculate the correlations for the ratings to see if this si a strong, moderate, or weak relationship.
## [1] "r values for poor wines, avg_wines, and excellent wines respectively"
## [1] 0.2410981 0.1102411 -0.1000730
Okay so not the best correlation or any at all. This jsut means taht a linear fit is not a good choice to make a model. I’m still going to go off the trends because I think this is an interesting find!
There was only one sweet wine, so only looked at comparing dry wines (residual sugar <45 g/L). My main features of intrest that I investigated were the fixed acidity, the sqrt of the volatile acidity, the density, and the alcohol content.
Initally I had rejected density as a factor because it was so tightly clustered, but it turns out that density is a clear determining factor for excellent wines. Also the alcohol content gave clear clusterings for the ratings. Excellent wines tend to have lower density than average wines. Also excellent wines tend to have higher alcohol content. Fixed acidity and volatile acidity didn’t yeild any decisive conclusions like density and alcohol.
I was surprised that high volatile acid levels didn’t really affect the ratings. Volatile acid is acetic acid, the same type of acid that is in vinegar. I would have expected larger amounts of VA to decrease the quality of the wine. after doing a bit of research (http://waterhouse.ucdavis.edu/whats-in-wine), mixing in sulfur dioxide is a way to decrease the amount of volatile acid. When we plot volative acid versus total sulfur dioxide, it turns out excellent wines have a lower VA amounts with higher SO2 levels, but bad and average wines do the opposite! their VA amounts increase as their SO2 levels increase.
I chose to look at a density plot instead of a histogram because the sample sizes for each rating are not even. there are 20 poor wines, 3817 average wines, and 1060 excellent wines. By doing a density plot, we can compare each group on an equal level. When I was trying to normalize the residual sugar histogram, I realized that it became bimodal under the log10 transformation. The in above plot we can see that each rating density is also bimodal, but excellent wines have a higer density peak at the lower sugar level and poor and average wines have higher densities at higher sugar levels.
I had begun to think that there wasn’t really any one factor that really affected the rating of a wine until I got this graph. From this graph, it is clear to see that the density of the wine decreases as the alcohol content increases. All three ratings show that relationship. But we can see a clustering of the excellent ratings at the higher alcohol contents. From this graph is, we can say that excellent wines tend to have higher alcohol content.
This project taught me a lot about data exploration. I realize that having a little bit of knowledge about your data subject is super helpful. I have pretty much no experience with white wine so I went into this exploration without any preconcieved ideas on what makes a excellent wine. I do know science and I also have tasted really acidic, or salty, or sweet foods and I know that tolerance and taste preferences play a huge part in how a food gets rated. I was glad that each wine was rated by at least three different critics so each quality score was an average, not just one person’s opinion. Because I was trying to answer a question that is inherently subjective, I found it really difficult to pinpoint exactly which factors answer that question. I was able to narrow it down to from 10 to four after the Univariate analysis. Then I narrowed it down to two, alcohol content and density. Residual sugar was a definately an intrest, since the sugar level is what determines if a wine is sweet or dry. There was only one sweet (sugar >45 g/L) wine, so I only did my analysis on the dry whites. I know that people prefer certain levels of sweetness for thier wines, so I wasn’t surprised that it wasn’t a desicive factor like density or alcohol.
Another interesting factor was the free sulfur dioxide level (SO2). SO2 is an antimicrobial added to wine to preserve it. It is mainly used to prevent the build up of volatile acidity (acetic acid) in the wine. Free SO2 levels are generally undetectable, but once they exceed 50 ppm (or 50 g/L to keep it within context of the given units), they are detectable by taste and smell. While SO2 wasn’t really a determining factor for quality, it showed some interesting behavoir. Excellent wines showed decreased levels of volatile acidity for increased levels of SO2, while average and poor wines showed increased levels of volatile acid for increased levles of SO2! Seems like the mix in excellent wines allowes the SO2 to do its thing!
Over all I felt that since I was in the dark about the subject matter, I wanted to explore everything! I love the ggpairs function because it allowed me to take a snapshot of different pairs and decide which ones I really wanted to play with. I was honestly shocked that the quality of the wine seems to boil down to alcohol percentage and density. But in doing some research for this project at my local supermarket, I realized that they are the only things you can know about a wine from the label. So even if I had found another relationship, I wouldn’t be able to use it without a lab! But on the flip side since I wasn’t seeing any relationships I felt like I was going down a rabbit hole with no end in sight! But that is the nature of EDA.
While I prefer python for Data Wrangling, R is SO much nicer for visualizations. I love the adding of layers and figuring out how to incorporate different factors in different ways was really fun. I love that I can have four different factors on one graph (one per axes, one for color, and one for size). Talk about powerful information!
If I had more time to really go down the rabbit hole, I would want to try to make up a model for wine quality. It was facinating to see how transforming one factor by taking a square root or log could have changed the correlations. Before transforming some variables, I barely saw any correlation between the varaialbes, but witht he transformations it relaly helped bring some patterns to light.
So next time I’m choosing a white wine from this brand, I stand a good chance of choosing an excellent wine if it has an alcohol content greater than 10.5% and a density less than 0.994 g/cm^3.